A bank need to have a way to decide if/when a customer can be granted a loan.
A doctor may need to decide if an oncological patient has to undergo a surgery or a less aggressive treatment.
A company may need to decide about investing in new technologies or stay with the traditional ones.
In all those cases a decision tree may provide a structured approach to decision-making that is based on data and can be easily explained and justified.
A decision tree is a graphical representation of a series of decisions and their potential outcomes.
It is obtained by recursively stratifying or segmenting the feature space into a number of simple regions.
Each region (decision) corresponds to a node in the tree, and each potential outcome corresponds to a branch.
The tree structure can be used to guide decision-making based on data.
We need context and also, we need to learn how to build, evaluate and optimize trees.
Context
When is it appropriate to rely on decision trees?
When would other approaches be preferable?
What type of decision trees can be used?
Build the tree
The two main types of trees are Classification and Regression Trees (CART).
Both types of decision trees follow the same basic principles for constructing the tree, but the splitting criteria and the method for predicting the response variable differ.
Building the tree requires deciding:
Evaluation is similar to other classifiers.
Optimization involves deciding:
There are mutliple packages for working with trees in R.
We will mostly use rpart and caret.
Check the document “CART_Examples.qmd” for a series of examples on how to build, validate and optimize trees with R.
predicted.classes<- predict(model1, PimaIndiansDiabetes2, "class")
mean(predicted.classes == PimaIndiansDiabetes2$diabetes)[1] 0.8294271
A better strategy is to use train dataset to build the model and a test dataset to check how it works.
A node is denoted by \(t\). We will also denote the left child node by \(t_{L}\) and the right one by \(t_{R}\).
Denote the collection of all the nodes in the tree by \(T\) and the collection of all the leaf nodes by \(\tilde{T}\)
A split will be denoted by \(s\). The set of splits is denoted by \(S\).
Keep in mind: the tree represents the recursive splitting of the space.
In the end, every leaf node is assigned with a class and a test point is assigned with the class of the leaf node it lands in.
The construction of a tree involves the following three general elements:
The selection of the splits, i.e., how do we decide which node (region) to split and how to split it?
If we know how to make splits (‘grow’ the tree), how do we decide when to declare a node terminal and stop splitting?
We have to assign each terminal node to a class. How do we assign these class labels?
To build a Tree, questions have to be generated that induce splits based on the value of a single variable.
Ordered variable \(X_j\):
Categorical variables, \(X_j \in \{1, 2, \ldots, M\}\):
The pool of candidate splits for all \(p\) variables is formed by combining all the generated questions.
The way we choose the split, is to measure every split by a ‘goodness of split’ measure, which depends on:
The ‘goodness of split’ in turn is measured by an impurity function.
Intuitively, when we split the points we want the region corresponding to each leaf node to be “pure”, that is, most points in this region come from the same class, that is, one class dominates.
Purity not increased
Purity increased
Impurity functions measure the extent of purity for a region containing data points from possibly different classes.
Definition: An impurity function is a function \(\Phi\) defined on the set of all \(K\)-tuples of numbers \(\left(p_{1}, \cdots, p_{K}\right)\) satisfying \(p_{j} \geq 0, \quad j=1, \cdots, K, \Sigma_{j} p_{j}=1\) with the properties:
Commonly used impurity functions:
If \(p_{j}=0\), use the limit \(\lim p_{j} \rightarrow \log p_{j}=0\).
Misclassification rate: \(1-\max _{j} p_{j}\).
Gini index: \(\sum_{j=1}^{K} p_{j}\left(1-p_{j}\right)=1-\sum_{j=1}^{K} p_{j}^{2}\).
\[ \begin{aligned} & I G(X, F)=\text { (original entropy) }-(\text { entropy after the split) } \\ & I G(X, F)=H(X)-\sum_{i=1}^{n} \frac{\left|x_{i}\right|}{X} H\left(x_{i}\right) \end{aligned} \]
\[ I G(X, F)=H(X)-\sum_{i=1}^{n} \frac{\left|x_{i}\right|}{X} H\left(x_{i}\right) \]
Since information gain subtracts the entropy of the child nodes our choice of split will be the one that maximises the information gain.-
This effectively penalizes small, pure nodes based on their size just as we needed.
Consider the problem of designing an algorithm to automatically differentiate between apples and pears (class labels) given only their width and height measurements (features).
| Width | Height | Fruit |
|---|---|---|
| 7.1 | 7.3 | Apple |
| 7.9 | 7.5 | Apple |
| 7.4 | 7.0 | Apple |
| 8.2 | 7.3 | Apple |
| 7.6 | 6.9 | Apple |
| 7.8 | 8.0 | Apple |
| 7.0 | 7.5 | Pear |
| 7.1 | 7.9 | Pear |
| 6.8 | 8.0 | Pear |
| 6.6 | 7.7 | Pear |
| 7.3 | 8.2 | Pear |
| 7.2 | 7.9 | Pear |
Maximizing information gain is one possible criteria to choose among splits.
In order to avoid excessive complexity it is usually decided to stop splitting when “information gain does not compensate for increase in complexity”.
In practice stop splitting if: \[ \max _{s \in S} \Delta I(s, t)<\beta \] where \(\Delta I\) represents the information gain associated with an optima split \(s\) and a node \(t\), and \(\beta\) is a pre-determined threshold.
The decision tree classifies new data points as follows.
If we use 0-1 loss, the class assignment rule is very similar to k-means (where we pick the majority class or the class with the maximum posterior probability):
\[ \kappa(t)=\arg \max _{j} p(j \mid t) \]
Let’s assume we have built a tree and have the classes assigned for the leaf nodes.
We want to estimate the classification error rate for this tree.
We need to introduce the resubstitution estimate \(r(t)\) for the probability of misclassification, given that a case falls into node \(t\). This is:
\[ r(t)=1-\max _{j} p(j \mid t)=1-p(\kappa(t) \mid t) \]
\[ R(T)=\sum_{t \in \tilde{T}} R(t) \]
Trees obtained by looking for optimal splits tend to overfit: good for the data in the tree, but generalize badly and tend to fail more in predictions.
In order to reduce complexity and overfitting while keeping the tree as good as possible tree pruning may be applied.
Essentially pruning works by removing branches that are unlikely to improve the accuracy of the model on new data.
Start by building a large tree that overfits the data.
Then, use cross-validation to estimate the optimal value of alpha that minimizes the generalization error.
Finally, prune the tree by removing the branches that have a smaller improvement in impurity than the penalty term multiplied by alpha.
Iterate the process continues until no more branches can be pruned, or until a minimum tree size is reached.
When the response variable is numeric decision trees are regression trees.
Some differences with classification trees are:
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. CRC press.
Brandon M. Greenwell (202) Tree-Based Methods for Statistical Learning in R. 1st Edition. Chapman and Hall/CRC DOI: https://doi.org/10.1201/9781003089032
Efron, B., Hastie T. (2016) Computer Age Statistical Inference. Cambridge University Press. Web site
Genuer R., Poggi, J.M. (2020) Random Forests with R. Springer ed. (UseR!)
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). Springer. Web site